Simulation-based optimization of Markov reward processes

نویسندگان

  • Peter Marbach
  • John N. Tsitsiklis
چکیده

We propose a simulation based algorithm for optimizing the average reward in a Markov Reward Process that depends on a set of parameters As a special case the method applies to Markov Decision Processes where optimization takes place within a parametrized set of policies The algorithm involves the simulation of a single sample path and can be implemented on line A convergence result with probability is provided This research was supported by contracts with Siemens AG Munich Germany and Alcatel Bell Belgium and by contract DMI with the National Science Foundation Introduction Markov Decision Processes and the associated dynamic programming DP methodology Ber a Put provide a general framework for posing and analyzing problems of se quential decision making under uncertainty DP methods rely on a suitably de ned value function that has to be computed for every state in the state space However many inter esting problems involve very large state spaces curse of dimensionality In addition DP assumes the availability of an exact model in the form of transition probabilities In many practical situations such a model is not available and one must resort to simulation or experimentation with an actual system For all of these reasons dynamic programming in its pure form may be inapplicable The e orts to overcome the aforementioned di culties involve two main ideas The use of simulation to estimate quantities of interest thus avoiding model based computations The use of parametric representations to overcome the curse of dimensionality Parametric representations and the associated algorithms can be broadly classi ed into three main categories a Parametrized value functions Instead of associating a value V i with each state i one uses a parametric form V i r where r is a vector of tunable parameters weights and V is a so called approximation architecture For example V i r could be the output of a multilayer perceptron with weights r when the input is i Other representations are possible e g involving polynomials linear combina tions of feature vectors state aggregation etc When the main ideas from DP are combined with such parametric representations one obtains methods that go un der the names of reinforcement learning or neuro dynamic programming see BT SB for textbook expositions as well as the references therein A key char acteristic is that policy optimization is carried out in an indirect fashion one tries to obtain a good approximation of the optimal value function of dynamic programming and uses it to construct policies that are close to optimal Such methods are reason ably well though not fully understood and there have been some notable practical successes see BT SB for an overview including the world class backgammon player by Tesauro Tes b Parametrized policies In an alternative approach which is the one considered in this paper the tuning of a parametrized value function is bypassed Instead one considers a class of policies described in terms of a parameter vector Simulation is employed to estimate the gradient of the performance metric with respect to and the policy is improved by updating in a gradient direction In some cases the re quired gradient can be estimated using IPA in nitesimal perturbation analysis see e g HC Gla CR and the references therein For general Markov processes and in the absence of special structure IPA is inapplicable but gradient estimation is still possible using likelihood ratio methods Gly Gly GG LEc GI c Actor critic methods A third approach which is a combination of the rst two includes parametrizations of the policy actor and of the value function critic BSA While such methods seem particularly promising theoretical understand ing has been limited to the impractical case of lookup representations one parameter per state KB This paper concentrates on methods based on policy parametrization and approx imate gradient improvement in the spirit of item b above While we are primarily interested in the case of Markov Decision Processes almost everything applies to the more general case of Markov Reward Processes that depend on a parameter vector and we proceed within this broader context We start with a formula for the gradient of the performance metric that has been presented in di erent forms and for various contexts in Gly CC FH JSJ TH CW We then suggest a method for estimating the terms that appear in that formula This leads to a simulation based method that updates the parameter vector at every regeneration time in an approximate gradient direction Furthermore we show how to construct an on line method that updates the parameter vector at each time step The resulting method has some conceptual similarities with those described in CR that reference assumes however the availability of an IPA estimator with certain guaranteed properties that are absent in our context and in JSJ which however does not contain convergence results The method that we propose only keeps in memory and updates K numbers where K is the dimension of Other than itself this includes a vector similar to the eligibility trace in Sutton s temporal di erence methods and as in JSJ an estimate of the average reward under the current value of If that estimate was accurate our method would be a standard stochastic gradient algorithm However as keeps changing is generally a biased estimate of the true average reward and the mathematical structure of our method is more complex than that of stochastic gradient algorithms For reasons that will become clearer later standard approaches e g martingale arguments or the ODE approach do not seem to su ce for establishing convergence and a more elaborate proof is necessary Our gradient estimator can also be derived or interpreted in terms of likelihood ratios Gly GG It takes the same form as the one presented in p of Gly but it is used di erently The development in Gly leads to a consistent estimator of the gradient assuming that a very large number of regenerative cycles are estimated while keeping the policy parameter at a xed value Presumably would be then updated after such a long simulation In contrast our method updates much more frequently and retains the desired convergence properties despite the fact that any single cycle results in a biased gradient estimate An alternative simulation based stochastic gradient method again based on a likeli hood ratio formula has been provided in Gly and uses the simulation of two regen erative cycles to construct an unbiased estimate of the gradient We note some of the di erences with the latter work First the methods in Gly involve a larger number of auxiliary quantities that are propagated in the course of a regenerative cycle Second our method admits a modi cation see Sections that can make it applicable even if the time until the next regeneration is excessive in which case likelihood ratio based methods su er from excessive variance Third our estimate  of the average reward is obtained as a weighted average of all past rewards not just over the last regenerative cycle In contrast an approach such as the one in Gly would construct an independent estimate of  during each regenerative cycle which should result in higher variance Finally our method brings forth and makes crucial use of the value di erential reward function of dy namic programming This is important because it paves the way for actor critic methods in which the variance associated with the estimates of the di erential rewards is poten tially reduced by means of learning value function approximation Indeed subsequent to the rst writing of this paper this latter approach has been pursued in KT SMS In summary the main contributions of this paper are as follows We introduce a new algorithm for updating the parameters of a Markov Reward Process on the basis of a single sample path The parameter updates can take place either during visits to a certain recurrent state or at every time step We also specialize the method to Markov Decision Processes with parametrically represented policies In this case the method does not require the transition probabilities to be known We establish that the gradient with respect to the parameter vector of the perfor mance metric converges to zero with probability which is the strongest possible result for gradient related stochastic approximation algorithms The method admits approximate variants with reduced variance such as the one described in Section or various types of actor critic methods The remainder of this paper is organized as follows In Section we introduce our framework and assumptions and state some background results including a formula for the gradient of the performance metric In Section we present an algorithm that per forms updates during visits to a certain recurrent state present our main convergence result and provide a heuristic argument Sections and deal with variants of the algo rithm that perform updates at every time step In Section we specialize our methods to the case of Markov Decision Processes that are optimized within a possibly restricted set of parametrically represented randomized policies We present some numerical results in Section and conclude in Section The lengthy proof of our main results is developed in the appendices Markov Reward Processes Depending on a Parameter In this section we present our general framework make a few assumptions and state some basic results that will be needed later We consider a discrete time nite state Markov chain fing with state space S f Ng whose transition probabilities depend on a parameter vector K and are denoted by pij P in j j in i Whenever the state is equal to i we receive a one stage reward that also depends on and is denoted by gi For every K let P be the stochastic matrix with entries pij Let P fP j Kg be the set of all such matrices and let P be its closure Note that every element of P is also a stochastic matrix and therefore de nes a Markov chain on the same state space We make the following assumptions Assumption The Markov chain corresponding to every P P is aperiodic Further more there exists a state i which is recurrent for every such Markov chain We will often refer to the times that the state i is visited as regeneration times Assumption For every i j S the functions pij and gi are bounded twice di erentiable and have bounded rst and second derivatives The performance metric that we use to compare di erent policies is the average reward criterion de ned by lim t t E t X k gik Here ik is the state at time k and the notation E indicates that the expectation is taken with respect to the distribution of the Markov chain with transition probabilities pij Under Assumption the average reward is well de ned for every and does not depend on the initial state Furthermore the balance equations

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Simulation-based optimization of Markov decision processes: An empirical process theory approach

We generalize and build on the PAC Learning framework for Markov Decision Processes developed in Jain and Varaiya (2006). We consider the reward function to depend on both the state and the action. Both the state and action spaces can potentially be countably infinite. We obtain an estimate for the value function of a Markov decision process, which assigns to each policy its expected discounted...

متن کامل

Simulation - Based Optimization of Markov

We propose a simulation-based algorithm for optimizing the average reward in a Markov Reward Process that depends on a set of parameters. As a special case, the method applies to Markov Decision Processes where optimization takes place within a parametrized set of policies. The algorithm involves the simulation of a single sample path, and can be implemented on-line. A convergence result (with ...

متن کامل

Simulation-Based Optimization of Markov Reward Processes: Implementation Issues

We consider discrete time, finite state space Markov rewaxd processes which depend on a set of parameters. Previously, we proposed a simulation-based methodology to tune the parameters to optimize the average reward. The resulting algorithms converge with probability 1, but may have a high variance. Here we propose two approaches to reduce the variance, which however introduce a new bias into t...

متن کامل

A State Aggregation Approach to Singularly Perturbed Markov Reward Processes

In this paper, we propose a single sample path based algorithm with state aggregation to optimize the average rewards of singularly perturbed Markov reward processes (SPMRPs) with a large scale state spaces. It is assumed that such a reward process depend on a set of parameters. Differing from the other kinds of Markov chain, SPMRPs have their own hierarchical structure. Based on this special s...

متن کامل

COVARIANCE MATRIX OF MULTIVARIATE REWARD PROCESSES WITH NONLINEAR REWARD FUNCTIONS

Multivariate reward processes with reward functions of constant rates, defined on a semi-Markov process, first were studied by Masuda and Sumita, 1991. Reward processes with nonlinear reward functions were introduced in Soltani, 1996. In this work we study a multivariate process , , where are reward processes with nonlinear reward functions respectively. The Laplace transform of the covar...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • IEEE Trans. Automat. Contr.

دوره 46  شماره 

صفحات  -

تاریخ انتشار 2001